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Abstract 

This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora 
with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to 
make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: 
i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the 
best single tagger and an ensemble tagger constructed out of the same small training sample. 
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1. Introduction 

When morpho-syntactically annotating a corpus with a 
new tagset, the initial stages of the annotation process face 
a bootstrapping problem. There are no automatic taggers 
available to help the annotator, and because of this, the an- 
notation process is too laborious to quickly produce ade- 
quate amounts of training material for the tagger A solu- 
tion which has been suggested in previous work (Teufel, 
1995; A.twell et al., 1994), is to use an existing tagger, and 



devise mapping rules between the old and the new tagset. 
However, as the construction of such mapping rules re- 
quires considerable linguistic knowledge engineering, this 
solution only shifts the problem to a different domain. 

In this paper we describe a new method that uses ma- 
chine learning and a very small corpus sample annotated in 
the new tagset. It allows us to exploit existing taggers and 
lexical resources with a wild variation in tagsets to quickly 
reach a level of tagging accuracy far beyond that of taggers 
trained on the initially very small annotated samples. 

The idea behind this method, which we will refer to as 
COMBI-BOOTSTRAP, comes from previous work on com- 
bining taggers to improve accuracy (Van Halteren et al., 
1998; IVan Halteren et al., 20001 iBrill and Wu, 19981). These 



approaches combine a number of taggers, all trained on 
the same corpus data and using the same tagset, to yield a 
combined tagger that has a much higher accuracy than the 
best component system. The reasoning behind this is that 
the components make different errors, and a combination 
method is able to exploit these differences. Simple com- 
bination methods, such as (weighted) voting, are confined 
to output that is i) in the same tagset as the components, 
and ii) is one of the tags suggested by the components. 
However, more sophisticated combination methods exist, 
which do not share these limitations. In Stacking (Wolpert, 
1992), the outputs of the component systems are used as 
features for a second level machine learning module, that is 
trained on held out data to correct the errors that the com- 
ponents make. First, this theoretically allows the second 
level learner to recognize situations where all components 
are in error, and correct these. Second, this lifts the re- 



quirement that the components use the same vocabulary of 
categories. We can in effect present the second level learner 
with any type of representations of the context to be tagged, 
such as the word itself, but also output from existing taggers 
with other tagsets. The positive effects of this approach are 
demonstrated in the remainder of this paper This is struc- 
tured as follows. In Section ^ we describe the data sets that 
are used in the experiments. In Section |3] we describe the 
component taggers and the machine learning method used 
for the second level learner In Section ^ we present the 
results of our experiments using a variety of combination 
setups. And finally, in Section pi we summarize and con- 
clude. 



2. Data 

We developed and tested our bootstrapping method in 
the context of the morpho-syntactic annotation of the "Cor- 
pus Gesproken Ned erlands" (Spoken Dutch Corpus; hence- 
forth called CGN) ( |Van Eynde et al., 2000| ). For this cor- 
pus, a fine-grained tagset was developed that distinguishes 
morphological and syntactic features such as number, case, 
tense, etc. for a total of approximately 300 tags. Annota- 
tion of this corpus has only just started, so we conducted 
experiments on three small samples (of respectively 5, 10 
and 20 thousand tokens, including punctuation) of the ini- 
tial corpus ' . 

As existing Dutch resources we use four popular taggers 
(described in Section ^) trained on (parts of) the written 
sections of the Eindhoven corpus ( (Jit den Boogaart, 1 975 ), 
tagged with either the WOTAN-1 (347 tags) or WOTAN- 
LITE (both with 641424 tokens of training data) or WOTAN- 
2 (1256 tags, and a slightly more modest 126803 tokens 



of training data) (Berghmans, 1994; Van Halteren, 1999) 



tagsets. Furthermore we will use the ambiguous lexical cat- 
egoriesH of words taken from the CELEX (Baayen et al., 

I ' 

'These were annotated by manually correcting tags produced 
by the first COMBI-BOOTSTRAP taggers 

'Not including function words like determiners pronouns etc. 
I.e. adjective, adverb, noun, number, exclamation, verb. 



1993) lexical database. The section of this database that we 
use, contains 300837 distinct word forms. 

On this data we measure the accuracy of single taggers 
trained on 90% of the data and tested on the remaining 
10%. To test the accuracy of a combined system, the 90% 
training data is split into nine pieces, and the four compo- 
nent taggers are tested on each part in turn (and trained on 
the remaining eight pieces, i.e. nine-fold cross-validation). 
The test outputs of the taggers on the nine training pieces 
are then concatenated and used as training material for the 
second level combination learner, which is tested on the re- 
served 10% test material. When examining the effects of 
including existing resources in the combination, both train 
and test set are tagged using some tagging system (e.g. an 
HMM tagger using WOTAN-1, or the ambiguous lexical 
categories from CELEX), and the effect is measured as the 
accuracy of the second level learner in predicting the target 
CGN tagging for the test set. 

3. Systems 

We experimented with four well-known trainable part 
of speech taggers: TNT (a trigram HMM tagger (Brants, 
2000)), MX POST (A Maximum Entropy tagger; (R atna- 
parkhi, 1996), henceforth referred to as MAX), The (Brill, 
1995) Rule based tagger ( referred to as RUL), and MBT 
(a Memory-Based tagger; ( Daelemans et al., 1996 )). The 
RUL tagger was not available trained on the WOTAN re- 
sources, because its training is too expensive on large cor- 
pora with large tagsets. 

As the combination method we have used IBl (Aha 
et al., 1991) a Memory Based Learning metho d imple- 
mented in the TiMBL^ ( [Daelemans et al., 2000| ) system. 
IBl stores the training set in memory and classifies test ex- 
amples by returning the most frequent category in the set of 
k nearest neighbors (i.e. the least distant training patterns). 
In the experiments below, we use the Overlap distance met- 
ric, no feature weighting, and fc = 1. 



4. Results 

4.1. Baselines 

When we train the separate taggers on training sets from 
the CGN corpus of three consecutive sizes, we obtain the 
accuracies shown Table || We also show the percentage 
of unknown words in each of the test partitions. Unknown 
words are defined as tokens that are not found in the 90% 
training partition. From this we can see that the perfor- 
mance on unknown words is a major component of the 
bootstrapping problem. We see that TNT has the best over- 
all score for all three training set sizes (resp. 84.49, 86.39, 
and 90.75 % correct). It also has the best scores for known 
words. Only for unknown words does it find a serious con- 
tender in MAX. When we do a str aightforward combination 
of the four taggers in the style of ( Van Halteren et al., 2000 ) 
with IBl as the second level learner we get a combined tag- 
ger with an accuracy of resp. 84.32, 87.24 and 90.46 % cor- 
rect for the 5k, 10k and 20k data sets. Only for the 10k set 





Data set size | 




5000 


10000 


20000 


CGN 


84.32 


87.24 


90.46 


CGN -hWord 


83.66 


87.59 


90.46 


CGN + CEL 


85.64 


88.18 


91.18 


CGN + Wl 


89.11 


90.50 


92.39 


CGN + WL 


88.45 


90.24 


92.48 


CGN + W2 


88.94 


89.55 


91.61 



Table 2: The effect of adding existing information sources 
one by one. 



this is better than the best individ ual tagger. The reason we 
do not obtain accuracy gains as in Van Halteren et al., 1998 ) 
here, is probably that the number of training cases for the 
second level lear ner is too small at this data set size. Also, 
as was shown in Van Halteren et al., 1998), IBl is not the 



^^ Available from http://ilk .kub .nl/ 



best combiner at small training set sizes. However, to keep 
the comparison simple, we will not use weighted voting 
combination here (which does perform better at small train- 
ing set sizes), because voting approaches cannot be used for 
the COMBI-BOOTSTRAP method. 

4.2. Combi-bootstrap: Reusing existing resources 

In this section we will add, one by one, a number of 
resources that use different tagsets. In contrast to the na- 
tive CGN taggers, these resources have much larger lexical 
coverage, and the taggers among them have been trained on 
much larger corpora (see data description in SectiongJ)- We 
will call the resources: CGN, for the block of four CGN- 
taggers trained in the previous section. Word for the word 
to be tagged itself, CEL for the ambiguous categories on 
the basis of CELEX. Wl, W2, and WL stand for WOTAN 
1, 2 and Lite blocks respectively (each of which contains 
three different taggers: MBT, MAX, and TNT). And, fi- 
nally. Wall stands for the set of all (nine) woTAN-based 
taggers. The way the resources are added is by including 
them as features in the case representation for the second 
level learner. Figure |l| illustrates this representation for the 
case of all sources being used. 

First we consider the effects of adding the information 
sources one by one to CGN. The results are shown in Ta- 
ble H This shows that every added resource has a positive 
effect. The largest improvement is obtained by adding the 
WOTAN taggers. Second, we tried to leave out the CGN 
block all together, and test the value of only the other infor- 
mation sources. This results in the scores shown in Table 0. 
Interestingly, we see that the separate existing resources by 
themselves are not very good predictors at all. In partic- 
ular CELEX (with only ambiguous main parts of speech) 
scores poorly. But also the blocks of three WOTAN taggers 
(MAX, TNT, MBT) with the same tagset (either Wl, W2 
or WL) are worse than the best CGN taggers trained from 
scratch. However, this is changed when we use the Wall 
combination: all 3 (algorithms) times 3 (tagsets) WOTAN 
taggers. In fact, this block, together with CELEX and the 
word itself, performs better (92.82% at 20k) than the best 
CGNh-WOTAN combination so far (92.48%). These results 
also show that CELEX and WORD are valuable additions. 





Data set size 


tagger 


u 


5000 
k 


t 


u 


10000 
k 


t 


u 


20000 
k 


t 


MBT 
TNT 
MAX 
RUL 


39.42 
49.04 
50.00 
29.81 


90.84 
91.83 
79.48 
87.65 


82.01 
84.49 

74.42 
77.72 


46.25 
50.00 
58.13 
37.50 


91.57 
92.16 
86.21 
87.50 


85.36 
86.39 
82.36 
80.65 


45.93 
57.42 
57.42 
40.19 


93.03 
94.48 
90.35 
89.71 


88.29 
90.75 
87.04 
84.72 


CGN ensemble 


84.32 


87.24 


90.46 


% unknown 


17.16 


13.69 


10.07 



Table 1 : Test set accuracies for taggers trained on 90% of the CGN data and tested on 10%. The accuracies for the single 
taggers are given separately for unknown (u), known (k), and all (t) tokens. The bottom row gives the percentage of 
unknown words for the test partition. 
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melfin) 
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VNW(pers, 
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noniin, 
vol. l,ev) 


VNW(pers, 
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nomin, 
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VNW(pers, 
pron, 
nomin, 
vol. Lev) 


VNW(pers. 
pron, 

vol. Lev) 


substanlief 


Pron(per. 
1. ev. 
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1, ev. 
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Pron (per, 
1. ev, 
nom) 


Pron(pers, 
first, sing, 
nom. sir, 
nom) 
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first, sing, 
nom, sir, 
nom) 


Pron(pcrs, 
first, sing, 
nom, sir. 


Pron(per. 
1, ev. 
nom) 


Pron (per, 
1. ev, 
nom) 


Pron (per, 
1, ev, 
nom) 


VNW(pers. 
pron, 

vol. Lev) 


voor 


VZ(iniO 


VZ(inil) 


VZ(i„i,) 


VZ(inil) 


subslaniief 


Prep(voor) 


Prep (voor) 


Prep(voor) 


Ad p (prep. 
obLobl 
+ dai, 
adp'adp + 
nampLiri) 


Adp(prcp. 
obLobl 
+ dai, 
adp'adp + 
nampart) 


Adp(prep, 
obl'obi 
+ dai. 
adp'adp + 
nampart) 


Prep 


Prop 


Prep 


VZ(inii) 


de 


LID(bL-p. 
siiin. rest) 


LID,b.p, 
Stan, rest) 


LIDCbep, 
sian, rest) 


LIDibcp, 
slaii, rest) 


UNKNOWN 
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zijd. Stan) 
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zijd. Stan) 
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substanlief 
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ev, neut) 
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otl, 2, ev) 
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N(prop, 
sing. 
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V(lex, in- 
Irans'intrans 
+ trans, 
pres, sl'sl 
+ s2i, 
hebben. 
nonscp. 
verb) 


V(lex. in- 

Irans'inlrans 
+ trans. 
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verb) 


V(on, 1, 
ev) 


V(oll. 1. 


V(on, 2, 


WW(pv, 



Figure 1: An example of case representations for the second level learner with all information sources as features. 





Data set size 




5000 


10000 


20000 


Word 


73.10 


75.60 


80.05 


CEL 


25.74 


27.40 


29.49 


Wl 


81.35 


82.45 


82.65 


WL 


78.38 


77.31 


77.35 


W2 


83.83 


86.64 


86.89 


Wall 


90.10 


91.01 


91.47 


Wall + CEL 


90.10 


91.01 


91.47 


Wall + Word 


90.92 


91.52 


92.43 


Wall + CEL + Word 


91.25 


91.52 


92.82 



Table 3: The effect of the information sources without the 
contribution of the CGN block. 



even though they are poor predictors by themselves. 

Finally, we threw all the information sources together 
in the combiner This has a further positive effect, as can 
be seen in Table Q. In fact, it seems that more sources is 
simply better Q. The best result (93.49% correct with all 
information sources at 20k data set size) shows 2.74% less 
errors than the best single CGN tagger, a 29.6% error re- 





Data set size | 




5000 


10000 20000 


CGN + Wall 


91.25 


91.44 93.40 


CGN + Wall + Word 


91.42 


91.44 93.35 


CGN + Wall + CEL 


91.25 


91.78 93.45 


CGN + Wall + CEL + Word 


91.42 


91.70 93.49 



We have, however, not tried to check this exhaustively by 
leaving out single CGN or WOTAN taggers. 



Table 4: The effect of large combinations. The boldface 
figures indicate the best results overall from this paper 



duction. The error reduction is even larger for smaller data 
set sizes, as can be seen in Table ||. In this table, the error 
reduction is also shown separately for known and unknown 
words. The gain for unknown words is dramatically larger 
than that for known words, showing that the effect of our 
method can mostly be attributed to the larger lexical cov- 
erage of the existing resources. Further analysis would be 
needed to separate this from the effect of better "unknown 
word guessing" of the existing taggers. 

Because the combination of all information sources 
contains sources of a very diverse character, a plausible 
intuition would be that feature weighting could help the 
Memory-Based classifier However, further experimenta- 
tion with TiMBL parameters showed that no parameter set- 





Data set size 


tagger 


u 


5000 
k 


t 


u 


10000 
k 


t 


u 


20000 
k 


t 


best single CGN (TNT) 

best COMBI-BOOTSTRAP 


49.04 
75.00 


91.83 

94.82 


84.49 
91.42 


50.00 
78.13 


92.16 
93.45 


86.39 
91.70 


57.42 
76.08 


94.48 
95.44 


90.75 
93.49 


A error (%) 


-50.9 


-36.6 


-44.7 


-56.3 


-16.5 


-39.0 


-43.8 


-17.4 


-29.6 



Table 5: Accuracy of the best COMBI-BOOTSTRAP system (the one using all information sources) and the best individual 
tagger trained only on the CGN data, and the associated percentage of error reduction. The scores are split out into unknown 
(u) and known (k) words, and total (t). 



ting had a significant gain over unweighted Overlap with 
fc = 1 for this data set. This would probably be different if 
we had more data to train the combiner on. However, such 
luxury is not typical of the main application context of the 
proposed method. 



5. Conclusion 

We have described COMBI-BOOTSTRAP, a new method 
for bootstrapping the annotation of a corpus with a new 
tagset from existing information sources in the same lan- 
guage and very small samples of hand-annotated material. 
COMBI-BOOTSTRAP is based on the principle of Stacking 
machine learning algorithms, and shows very good perfor- 
mance on the CGN corpus that we have experimented with. 
The best performance was obtained when all available in- 
formation sources are used at the same time, which yields 
an error reduction of up to 44.7% in one case. As the test 
samples are very small, however, further experimentation 
will be needed on other corpora. 

Most importantly, we have shown that if existing re- 
sources are available, a tagger for a new corpus and tagset 
can quickly be lifted into a workable accuracy-range for 
manual correction. Moreover, the proposed method seems 
promising for application in other domains such as word 
sense disambiguation or parsing, where large training re- 
souces are difficult to construct and existing representation 
schemes are very diverse. 
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